Home

Column

Gender Classification via Voice

Jake Whalen

CS 584 Final Project
Fall 2017
Start

Summary

Column

Choosing a Project
Topic? Sports, Beer, Other
Supervised or unsupervised learning?
Data Source: Download, Web Scrape, Social Media
Tools: Python, R, Weka, Tableau, Excel

Choice
Data from Kaggle
Audio Analysis
Supervised Learning
Classification
Machine Learning in Python
Presentation & Report in R Markdown
Excel for results transfer

Goals
Classify audio clip subjects gender
Learn what audio features best separate genders

Method

Column

Exploration

  1. Read data into R
  2. Ran summary functions on features
  3. Plot the data
  4. Look for patterns and relationships
  5. Determine what features seperate genders best

Column

Classification

  1. Used Scikit-learn in Python
  2. Split the data for training/testing (2/3, 1/3)
  3. Used gridsearch to identify the best parameters
  4. KNN (K-Nearest Neighbors)
  5. Decision Tree (DT)
  6. Suport Vector Machine (SVM)
  7. Logistic Regression (Log R)
  8. Observed classifications
  9. Attempt to improve on initial results
  10. Apply transformations to the data
  11. Refit the models

Machine Learning

Column

Review

  • Confusion Matrix
  • Overall Accuracy Scores
  • Male Accuracy
  • Female Accuracy
  • ROC & AUC
  • Parameter Influence
  • Fit & Score Times

Overview

Column

Description

Dataset Comments
  • Database created to identify a voice as male or female, based upon acoustic properties of the voice and speech.
  • The dataset consists of 3,168 recorded voice samples, collected from male and female speakers.
  • The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).
  • The samples are represented by 21 different features
  • Source: Voice Gender Data

Definitions

Sample

EDA

Column

Classes

Distributions

Boxplots

Heatmap

Scatter Plot

3D Plot

T Test

KNN

Column

K-Nearest Neighbors >>>

Summary
  • Used untransformed data
  • Better then a dumb classifier (50/50)
  • Distance weights outperformed Uniform weights
  • P: Manhattan Distance produced better CV results (p=1)
  • Algorithm: auto attempts to decide the most appropriate algorithm based on values
  • Weights: distance weights points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away
Best Parameters
  • algorithm = auto, n_neighbors = 11, p = 1, weights = distance

Column

Decision Tree

Column

Decision Tree >>>

Summary
  • Used untransformed data
  • MeanFun, sp.ent & IQR account for +90% of feature importance
  • Better at identifying males
  • Easiest model to interpret (follow the branches)
  • Presort: presort the data to speed up the finding of best splits in fitting
  • Splitter: The strategy used to choose the split at each node
  • Tree
Best Parameters
  • criterion = gini, max_depth = 21, presort = TRUE, splitter = random

Column

SVM

Column

Support Vector Machine >>>

Summary
  • Modified penalty parameter to achieve better results
  • Higher penalties achieved better scores
  • Better at classifying males
Best Parameters
  • C = 48

Column

Log Reg

Column

Logistic Regression >>>

Summary
  • Untransformed data
  • Best Male accuracy
  • Outperformed Log Reg (Normal)
  • C: Inverse of regularization strength
  • fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
  • penalty: Used to specify the norm used in the penalization
Best Parameters
  • C = 0.7000000000000001, fit_intercept = TRUE, penalty = l1

Column

Random Forest

Column

Random Forest >>>

Summary
  • Best Female accuracy
  • Took longer to fit then Decision Tree
Best Parameters
  • criterion = entropy, max_depth = 9, n_estimators = 15

Column

KNN (PCA)

Column

K-Nearest Neighbors (PCA) >>>

Summary
  • Best overall accuracy
  • 9 PCA components used
  • The fewer the neighbors the better
Best Parameters
  • algorithm = auto, n_neighbors = 3, p = 1, weights = distance

Column

SVM (PCA)

Column

Support Vector Machine (PCA) >>>

Summary
  • Improvement over SVM on untransformed data
  • Adjusted penalty parameter C of the error term
  • achieved best performance at much lower penalty parameter levels
Best Parameters
  • C = 10

Column

Log Reg (Normal)

Column

Logistic Regression (Normalized) >>>

Summary
  • Performed worse then Log Regression on untransformed data
  • Decrease in performance due to decrease in Male accuracy
  • Slight improvement in Female accuracy compared to first Log Reg
Best Parameters
  • C = 0.9000000000000001, fit_intercept = TRUE, penalty = l1

Column

Conclusions

Criteria


Accuracy
  1. KNN (PCA)
  2. Random Forest
  3. Log Regression
Male Accuracy
  1. Log Regression
  2. Log Regression (Normal)
  3. KNN (PCA)
Female Accuracy
  1. Random Forest
  2. KNN (PCA)
  3. Log Regression (Normal)
AUC
  1. Random Forest
  2. Log Regression
  3. Log Regression (Normal)

ROC


Area Under the Curve
  • KNN: 0.8899249
  • Decision Tree: 0.9606488
  • SVM: 0.9611217
  • Log Reg: 0.9961107
  • KNN (PCA): 0.9921023
  • Random Forest: 0.9979454
  • SVM (PCA): 0.9930792
  • Log Reg (Normal): 0.9955755

Fitting Times

Scoring Times

Conclusion

Best Model
Best Model: Random Forest
2nd highest overall accuracy
1st Female accuracy
Highest Area Under the Curve
Decent Fitting Time
Faster Scoring Time
Improvements
Focus on a single method
Combine features to create new ones
Implement more advanced methods (Bagging/Boosting)
Extract features from raw audio files